GGPlot is part of the tidyverse. It is called a “grammar of graphics” (the gg) because you have different elements that you add by layers, like verbs and adjectives they transform the visualization in some way, as those transform meaning. In base R you can’t just change things by addition +. THere is of course no shame however in using the base functions for plotting, sometimes they offer more liberty and are handier. Often however ggplot will be able to handle complex visualizations quicker. We can then exploit that to do more with our graphs.
This is the general structure (r4ds)
Below is a basic GGplot structure.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
We can specify a number of aspects with this grammar.
Main elements of GGplot grammar of graphics * Data: Acceptable
formats are data.frame or tibble * Geometry:
geom_, functions like geom_point(),
geom_line() * Stats: stats_, for statistical
transformations, like stat_summary() * Aesthetic:
aes(), for mapping variables to visual properties * Facets:
facets_, for creating multi-panel plots with
facet_wrap() or face_grid() * Coordinates:
coord_ for adjusting scale and axis,
e.g. coord_flip(), scale_x_log10()
We are going to use the famous Zebra Finch dataset to explain how GGplot works.
See the data card for Darwin’s Finch Evolution Dataset from in Kaggle
(We are taking inspiration from the python analysis here
# let's load the tidyverse, which holds ggplot
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#setwd('./03_session')
# let's load the Finch data, it is actually four datasets
beak_t1<- read.csv("./data/finch_beaks_1975.csv") # peak measurements in1 1975 for scandens / fortis
beak_t2<- read.csv("./data/finch_beaks_2012.csv") # peak measurements in1 2012 for scandens / fortis
# fortis<- read.csv("./data/fortis_beak_depth_heredity.csv") # species heredity info
# scandens<- read.csv("./data/scandens_beak_depth_heredity.csv") # other species heredity info
Following up on our data wrangling worksheets. We will find it a lot handier for plotting if we combine / harmonize datasets.
# Add a new variable to each dataset to indicate the year
beak_t1 <- beak_t1 %>% mutate(year = 1975)
beak_t2 <- beak_t2 %>% mutate(year = 2012)
# Combine the datasets using bind_rows (dplyr::bind_rows())
finch_data <- bind_rows(beak_t1, beak_t2)
# View the combined dataset
head(finch_data)
## band species Beak_length.mm Beak_depth.mm year
## 1 2 fortis 9.4 8.0 1975
## 2 9 fortis 9.2 8.3 1975
## 3 12 fortis 9.5 7.5 1975
## 4 15 fortis 9.5 8.0 1975
## 5 305 fortis 11.5 9.9 1975
## 6 307 fortis 11.1 8.6 1975
Let’s start with a simple scatterplot to explore the relationship between beak length and depth.
# check the dataset is of the right format: here data.frame
class(finch_data)
## [1] "data.frame"
# let's select the fortis data to plot
fortis_data<- finch_data %>% filter(species=="fortis")
# we start by specifying the dataset
ggplot(fortis_data,
# aesthetic layer: we specify what is the x-axis variable, and the y-axis variable
aes(x = Beak_length.mm, y = Beak_depth.mm)
) +
# we specify that we want dots
geom_point()
# Try plotting the 'scandens' data now based on the above
We can also readily display the data for scandens and fortis in the
same plot (a bit messy) with grouping parameter colour or
group (not with geom_point()), to see wide differences in
beak shape between species.
ggplot(finch_data,
# aesthetic layer: we separate groups by colour
aes(x = Beak_length.mm, y = Beak_depth.mm, colour = species)) +
geom_point()
Try to produce a scatterplot of beak lengths with
geom_boxplot() by species.
What if we now want to split by year too to explore (relatively) short term evolution? Faceting allows you to create multiple panels for subsets of your data.
You can also create a grid of facets. This way we can create a 2D grid with one factor determining the row and another the column.
Now labels are essential for making your plots informative.
ggplot(
fortis_data, # another way to select the species
aes(x = Beak_length.mm, y = Beak_depth.mm)) +
geom_point() +
# Adding labels to a plot
labs(
x = "Beak length in mm", # x-axis label
y = "Beak depth in mm", # y-axis label
title = "Finch Beak Size by Beak Depth in Fortis",
caption = "Source: Kaggle Finch Dataset" # a good place to place the source of the data for instance
)
You can also label individual points:
# Label points with geom_text
ggplot(mpg, aes(x = displ, y = hwy, label = model)) + geom_text()
Use ggsave() to save your plots with high resolution.
ggsave("finch_plot.png", dpi = 600, height = 4, width = 5, units = "in")
Themes allow you to customize the appearance of your plots in one go.
# Apply a black-and-white theme
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
theme_bw()
We can further customize themes. Here by removing grid lines and increasing font size.
# Apply a black-and-white theme
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
theme_bw(base_size = 16) +
theme(
# remove grids by specifying theme element_
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
You can also change the font. Being able to change the size of font in all elements of a plot by changing one argument is really handy when it comes to prepare plots for posters or publications.
# Change font to serif
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size=8, alpha=.2) +
theme_bw(base_size = 30) +
theme(text = element_text(family = "serif"))
# Exercise
# Check ?theme_bw() and ?theme() to find out about theme_bw() and general theme parameters
# 1. assign the plot to a handle h
# 2. try different pre-defined themes with h+theme_*
# 3. change the default font family to "sans" (sans serif)
# 4. change the "face" of the font to "italic"
# 5. change the font size and dots' size so it is visible from 3 meters away (hint: size parameter for dots)
# 6. add transparency to the dots so we can appreciate their density (hint: alpha parameter for dots)
# Alpha adjusts the transparency with a range of 0-1 with 1 being entirely opaque, 0 invisible
Create xxx in the chunk below.
# Try using the formatting conventions
# remember the + separating graph layers and the , separating parameters within a function
Work on the following graph.
# Change font to serif
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size=8, alpha=.2) +
theme_bw(base_size = 30) +
theme(text = element_text(family = "serif"))
# Exercise
# Check ?theme_bw() and ?theme() to find out about theme_bw() and general theme parameters
# 1. assign the plot to a handle h
# 2. try different pre-defined themes with h+theme_*
# 3. change the default font family to "sans" (sans serif)
# 4. change the "face" of the font to "italic"
# 5. change the font size and dots' size so it is visible from 3 meters away (hint: size parameter for dots)
# 6. add transparency to the dots so we can appreciate their density (hint: alpha parameter for dots)
# Alpha adjusts the transparency with a range of 0-1 with 1 being entirely opaque, 0 invisible
The R Graph Gallery website shows you the huge range of styles supported by GGPlot
# 1. Got to the R Graph Gallery website
# 2. Browse through chart types
# 3. See whether you can change chart code to explore the Finch dataset
Key Takeaways
GGplot’s grammar of graphics allows for flexible and layered visualizations.
Faceting and aesthetics make it relatively easy to explore and display data subsets.
Themes and high-resolution saving ensure publication-quality plots.
Additional resources
See the official documentation, including (overwhelming to my taste) cheat-sheets